Audiovisual Speech Recognition with Articulator Positions as Hidden Variables

نویسندگان

Mark Hasegawa-Johnson

Karen Livescu

Partha Lal

Kate Saenko

چکیده

Speech recognition, by both humans and machines, benefits from visual observation of the face, especially at low signal-to-noise ratios (SNRs). It has often been noticed, however, that the audible and visible correlates of a phoneme may be asynchronous; perhaps for this reason, automatic speech recognition structures that allow asynchrony between the audible phoneme and the visible viseme outperform recognizers that allow no such asynchrony. This paper proposes, and tests using experimental speech recognition systems, a new explanation for audio-visual asynchrony. Specifically, we propose that audio-visual asynchrony may be the result of asynchrony between the gestures implemented by different articulators, such that the most visibly salient articulator (e.g., the lips) and the most audibly salient articulator (e.g., the glottis) may, at any given time, be dominated by gestures associated with different phonemes. The proposed model of audio-visual asynchrony is tested by implementing an “articulatory-feature model” audiovisual speech recognizer: a system with multiple hidden state variables, each representing the gestures of one articulator. The proposed system performs as well as a standard audiovisual recognizer on a digit recognition task; the best results are achieved by combining the outputs of the two systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An articulation model for audiovisual speech synthesis - Determination, adjustment, evaluation

The authors present a visual articulation model for speech synthesis and a method to obtain it from measured data. This visual articulation model is integrated into MASSY, the Modular Audiovisual Speech SYnthesizer, and used to control visible articulator movements described by six motion parameters: one for the up-down movement of the lower jaw, three for the lips and two for the tongue. The v...

متن کامل

Visual information and redundancy conveyed by internal articulator dynamics in synthetic audiovisual speech

This paper reports results of a study investigating the visual information conveyed by the dynamics of internal articulators. Intelligibility of synthetic audiovisual speech with and without visualization of the internal articulator movements was compared. Additionally speech recognition scores were contrasted before and after a short learning lesson in which articulator trajectories were expla...

متن کامل

A Stochastic Articulatory-to-acoustic Mapping as a Basis for Speech Recognition

Hidden Markov models (HMMs) of speech acoustics are the current state-of-the-art in speech recognition, but these models bear little resemblance to the processes underlying speech production (Lee, 1989). In this respect, using an HMM to model speech acoustics is like using a Gaussian distribution to model data generated by a Poisson process – to the extent that the model is not an accurate repr...

متن کامل

Improving on Hidden Markov Models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve o...

متن کامل

A Hybrid HMM/BN Acoustic Model for Automatic Speech Recognition

In current HMM based speech recognition systems, it is difficult to supplement acoustic spectrum features with additional information such as pitch, gender, articulator positions, etc. On the other hand, Bayesian Networks (BN) allow for easy combination of different continuous as well as discrete features by exploring conditional dependencies between them. However, the lack of efficient algorit...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Audiovisual Speech Recognition with Articulator Positions as Hidden Variables

نویسندگان

چکیده

منابع مشابه

An articulation model for audiovisual speech synthesis - Determination, adjustment, evaluation

Visual information and redundancy conveyed by internal articulator dynamics in synthetic audiovisual speech

A Stochastic Articulatory-to-acoustic Mapping as a Basis for Speech Recognition

Improving on Hidden Markov Models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

A Hybrid HMM/BN Acoustic Model for Automatic Speech Recognition

عنوان ژورنال:

اشتراک گذاری